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We study a nonparametric method that estimates the structure of 
a discrete undirected graphical model from data. We assume that the 
distribution generating the data smoothly evolves over time and that 
the given sample is not identically distributed. Under the assumption 
that the underlying graphical model is sparse, the method recovers 
the structure consistently in the high dimensional, low sample size 
setting. 

1. Introduction. Consider the following real world problems: 

• Analysis of gene regulatory networks. Suppose that we have a set of 
n microarray measurements of gene expression levels, obtained at dif- 
ferent stages during the development of an organism or at different 
times during the cell cycle. Given this data, biologists would like to 
get insight into dynamic relationships between different genes and how 
these relations change at different stages of development. The problem 
is that at each time point there is only one or at most a few measure- 
ments of the gene expressions; and a naive approach to estimating the 
gene regulatory network, which uses only the data at the time point 
in question to infer the network, would fail. To obtain a good estimate 
of the regulatory network at any time point, we need to leverage the 
data collected at other time points and extract some information from 
them. 

• Analysis of stock market. In a finance setting, we have values of dif- 
ferent stocks at each time point. Suppose, for simplicity, that we only 
measure whether the value of a particular stock is going up or down. 
We would like to find the underlying transient relational patterns be- 
tween different stocks from these measurements and get insight into 
how do these patterns change over time. Again, we only have one mea- 
surement at each time point and we need to leverage information from 
the data obtained at nearby time points. 

• Understanding social networks. There are 100 Senators in the U.S. 
Senate and each can cast a vote on different bills. Suppose that we are 
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given n voting records over some period of time. How can one infer the 
latent political liaisons and coalitions among different senators and the 
way these relationships change with respect to time and with respect 
to different issues raised in bills just from the voting records? 

What is common to the above described problems is that they all con- 
cern with estimating a sequence of time-specific latent relational structures 
between a fixed set of entities (i.e., variables), from a time series of observa- 
tion data of entities states; and the relational structures between the entities 
are time evolving, rather than being invariant throughout the data collec- 
tion period as commonly assumed in nearly all previous work on structure 
estimation such as [5, 19, 21, 22]. Typically, the available data for the prob- 
lem are very scarce, with only one or at most a few measurements per time 
point corresponding to any particular latent structure; and the data are very 
high-dimensional, with the total number of observations small compared to 
the total number of potential relations, which make the problem of struc- 
ture estimation even more challenging than the static case studied recently 
by [21]. 

A popular model for the relational structure over a fixed set of entities that 
is widely studied is the Markov random field (MRF) [16, 26]. Let G = (V, E) 
represent a graph, of which V denotes the set of vertices, and E denotes the 
set of edges over vertices. Depending on the specific application of interest, 
a node u G V can represent a gene, a stock, or a social actor, and an 
edge (u, v) G E can represent a relationship (e.g., correlation, influence, 
friendship) between actors u and v. Let X = (Xi, . . . , X p )', where p = |V|, be 
a random vector of nodal states following a probability distribution indexed 
by 9 G O. Under a MRF, the nodal states X^s are assumed to be discrete, 
i.e., X u G X = {s\, . . . , Sfc}, and the edge set E C V x V encodes certain 
conditional independence assumptions among components of the random 
vector X, for example, the random variable X u is conditionally independent 
of the random variable X v given the rest of the variables if (u, v) G" E. 
Under the special case of binary nodal states, e.g., X u G X = {—1, 1}, and 
assuming pairwise potential weighted by 6 UV for all (u, v) G E and 9 UV = for 
all (u, v) G" E, the joint probability of X = x can be expressed by a simple 
exponential family model: Pe(x) = ^ exp{J2 u <v^uvX u x v }, also known as 
the Ising model, where Z denotes the partition that is usually intractable to 
compute. A number of recent papers have studied in depth how to estimate 
this model from data that are assumed to be i.i.d. samples from the model, 
and the asymptotic guarantee of the estimator [5, 21]. In particular, an 
important focus has been on the problem of structure estimation of the 
graph topology represented by E. It has been shown that under certain 
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variable conditions, it is possible to obtain an estimator of the edge set E 
that achieve a property known as sparsistency [21], which refers to the case 
where a consistent estimator of E can be attained when the true degree 
(i.e., number of neighbors) of each node is much smaller than the size of the 
graph p. 

In this paper, we are interested in learning the graph structures of MRFs 
from observational data, but under a more demanding scenario where the 
data {x*} are not i.i.d. samples from a time-invariant MRF, but from a se- 
ries of time-evolving MRFs {Pfft(-)} t £T n , where T n = {1/n, 2/n, . . . , 1} is the 
time index set; and our goal is to estimate the sequence of graphs {G t }t£T n 
underlying each observation x* ~ Fgt in the time series, rather than a single 
static graph G underlying P# over all time points. Under the traditional as- 
sumption of data sampled i.i.d. from an invariant P#, structural estimation 
of a MRF can be cast as a neighborhood selection problem for each node in 
the graph based on a £i-norm regularized regression procedure, of which the 
theoretical guarantees have been recently thoroughly studied [21], as we re- 
view shortly. We instead focus on estimating the graph structures from a set 
of n independent, high-dimensional observations which are NOT identically 
distributed, which is arguably a more realistic characteristic of the data. 
We will assume that the probability distribution generating observations 
changes smoothly over time, which is a critical assumption that allow us to 
infer graphs with time-varying structure and get insight into the dynamics 
of the relational changes. Because of this more general problem we are near 
the extremum of the high-p/low-n scenario for high-dimensional inference in 
the traditional sense, (i.e., n is approaching 1, corresponding to as few as 1 
instance of x per time-specific MRF), it is intriguing to ask, can we reliably 
estimate the changing graph structure and, if so, under what conditions? 

It is noteworthy that the problem of the graph structure estimation is 
quite different from the problem of (value-) consistent estimation of the 
unknown parameter 6 that indexes the distribution. In general, the graph 
structure estimation requires a more stringent assumptions on the under- 
lying distribution and the parameter values. For example, observe that a 
consistent estimator of in the Euclidean distance does not guarantee a 
consistent estimation of the graph structure, encoded by the non-zero pat- 
ter of the estimator. In the motivating problems that we started with, the 
main goal is to understand the interactions between different actors. These 
interactions are more easily interpreted by a domain expert than the nu- 
merical values of the parameter vector 6 and have potential to reveal more 
information about the underlying process of interest. This is especially true 
in situations where there is little or no domain knowledge and one is inter- 
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ested in obtaining casual, preliminary information. 

1.1. The Model. Given n observations from time series of nodal states 
{x*}, which are assumed for simplicity to be equidistant in time, we write 
the time index set as T n = {1/n, 2/n, . . . ,1}, and denote by T> n = {x* ~ 
¥gt\t G T n } a sample, where each data point comes from a different discrete 
time step and is distributed according to a distribution P t indexed by t . 
Specifically, we assume that the p-dimensional random vector X* takes values 
in { — 1, 1} P and the probability distribution takes the following form: 



where Z(6 t ) is the partition function, 6 t G is the parameter vector and 
G = (V,E l ) is an undirected graph representing certain conditional inde- 
pendence assumptions among subsets of the p-dimensional random vector 
X*. The problem we are addressing is: 

For every given time point r G T n , estimate the graph structure 
G T associated with ¥gr, given the observations T> n . Indeed, our 
algorithm and analysis apply more generally to r G [0, 1], i.e., the 
estimator does not have to be overlapping in time with the samples. 

Since we are primarily interested in a situation where the total number 
of observation n is small compared to the dimension p, and at each time 
point there is only one realization of the random variable, we will have to 
impose some regularity conditions on the distribution generating the obser- 
vations and on the rate of change of parameter l with respect to time in 
order to reliably estimate the graph structure G T . We will impose two nat- 
ural assumptions: the sparsity of the graphs {G*}f G -r n i an d the smoothness 
of the parameters l as functions of time, which we give in Section 3. Com- 
pared to the estimation of the graph structure from an i.i.d. sample [21], the 
main technical difficulty we have to deal with is the graph structure that 
changes over time. In order to analyze the problem in a sensible asymptotic 
framework, we need to assume that the graph structure "locally" does not 
change too much and, as the sample size grows, there are enough samples 
to estimate the graph structure in some small neighborhood. 

The model given in Eq. (1) can be thought of as a nonparametric ex- 
tension of conventional MRFs, in the similar way as the varying-coefficient 
models [8, 17] are thought of as an extension to the linear regression models. 
The difference between the model given in Eq. (1) and an MRF model is 
that our model allows for parameters to change, while in MRF the param- 
eters are considered fixed. Allowing parameters to vary over time increases 
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the expressiveness of the model, and make it more suitable for longitudi- 
nal network data. For simplicity of presentation, in this paper we consider 
time- varying MRFs with only pairwise potentials as in Eq. (1). Note that in 
the case of discrete MRFs there is no loss of generality by considering only 
pairwise interactions, since any MRF with higher-order interactions can be 
represented with an equivalent MRF with pairwise interactions [26]. 

1.2. Related Work. As mentioned in the introduction, most existing work 
on structure estimation assumes a fixed graphical model over i.i.d. data. In 
addition to the MRF model mentioned above for discrete cases, the Gaussian 
graphical model (GGM) has been a standard model for continuous-valued 
nodal states, in which the state vector x follows a multivariate Gaussian 
distribution JV(0, £), and the non-zero pattern of the concentration matrix 
encodes the structure of the GGM. Under the assumption that the graph 
structure does not change with time, there has been a large (and growing) 
body of literature that deals with estimation of the concentration matrix 
from i.i.d. observational data. These methods can be roughly summarized 
into two categories: consistent estimation of the concentration matrix in a 
given matrix norm, and consistent estimation of the non-zero pattern of 
the concentration matrix. The first group of methods include penalized es- 
timation methods [2, 11, 12, 14, 23, 28], while the second group of methods 
includes using a hypothesis testing [10] and penalized methods [19, 20, 22]. 
A number of authors have also analyzed regularized estimation of the co- 
variance matrix [3, 4, 18, 25, 27]. 

The most relevant work in the literature related to this paper is the esti- 
mation of time- varying Gaussian graphical models considered in [29] , where 
the authors assume that the observations A/"(0, £*) are independent, 
but not identically distributed, realizations of a multivariate Gaussian dis- 
tribution whose covariance matrix varies with time. This model represents 
a counterpart in the continuous domain of our model defined in Eq. (1). 
The problem of consistent, in the Frobenius norm, estimation of covariance 
and concentration matrix is addressed under the assumption that the con- 
centration matrix is sparse and that the covariance matrix varies smoothly 
with time. Note, however, that the problem of consistent estimation of the 
non-zero patterns in the concentration matrix is not addressed there. In 
[29], the estimator S r of the covariance matrix S T is obtained by maximiz- 
ing the log-likelihood of the data defined by a nonparametric estimate of 
the sample covariance matrix. However, this maximum likelihood approach 
does not carry over to the case of discrete MRFs concerned in this paper 
due to intractability of evaluation of the log-partition function. 
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In the case of discrete MRFs, direct maximization of the log-likelihood is 
not tractable. When the structure of the graph is assumed to be invariant, 
[21] use a surrogate function which decomposes across different nodes and 
as a result can be maximized efficiently. It can be shown that the resulting 
estimator can consistently recover the static graph structure. Our method 
can be thought of as a nonparametric extension to the approach in [21], 
which allows for graph structure to change over time. Other work on graph 
structure estimation in discrete MRFs include score based searches [7, 24], 
which are limited to restricted classes of graphs due to the combinatorial 
explosion of the search space of graphs [6] , minimizing the Kullback-Leibler 
divergence [1] and other pseudo-likelihood methods [5, 9]. 

1.3. Highlights of This Paper. We first propose an algorithmic strategy 
which can be understood as a nonparametric extension of the method de- 
veloped in [21]. The time- varying graph estimation problem is decomposed 
into a sequence of neighborhood estimation problems, one for each node, 
at every time point in question. We formulate the time-specific neighbor- 
hood estimation as a penalized convex optimization problem, which can 
be efficiently solved. Running the estimation procedure for each node and 
combining the estimates gives the resulting estimate of the overall graph 
structure at a time point. The decomposition of the graph structure estima- 
tion into the sequence of neighborhood estimation problems has previously 
been successfully used in [19, 21]. 

To estimate the neighborhood of each node at an arbitrary time point of 
interest, we express the conditional log-likelihood of each node given the rest 
of nodes in the graph as a generalized time- varying coefficient model [13], 
and maximize the l\ penalized conditional log-likelihood. Using a kernel to 
reweigh samples from different time points with respect to the time point in 
question, we approximate all the time-varying coefficient functions locally 
by a constant coefficient function pertaining to the time point for which 
the graph is being estimated, and the t\ penalty is introduced to perform 
model selection by shrinking the estimates towards zero and creating sparse 
estimates, which corresponds to estimated neighborhoods with few edges. 
The immediate benefit of this approach is that we can readily use the existing 
fast algorithms for the l\ penalized methods. Details of this algorithm are 
given in Section 2. 

The main contribution of the paper is to provide theoretical guarantees 
of the proposed estimation procedure. In particular, we establish conditions 
under which the procedure is able to recover the underlying time-varying 
graph structure consistently. We show that the following two conditions are 
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sufficient for the graph recovery: the first one guarantees that the coefficient 
functions change smoothly over time so that they can be approximated well 
locally with a constant; the second one guarantees that we can reliably sepa- 
rate the non-zero coefficients from the zero coefficients. We analyze the case 
where the number of observations is smaller than the number of parameters 
in the model, but the number of true relevant parameters is small, i.e., the 
graph structure is sparse. The conditions and the main theorem are given 
in Section 3. To establish the main theorem, we will need to show that the 
sample Fisher information matrix is concentrated around the true Fisher in- 
formation matrix associated with the conditional likelihood model for each 
node at the time point of interest. In Section 4 we state technical lemmas 
that establish these results, while the proofs are deferred to the appendix. 
Note that the results on the sample Fisher information matrix do not fol- 
low from the similar results given in [21] due to the non-i.i.d. nature of the 
samples. Section 5 is devoted to the proof of the main theorem. Concluding 
remarks are given in Section 6. 

2. Estimation Procedure. Let V n = {x* ~ ¥gt\t G T n } be a sample 
of n data points, which are sampled independently from Pgt at discrete 
time steps indexed by T n , where F e t is given in Eq. (1). The problem is to 
estimate the graph structure of the Markov random field associated with 
the distribution Pgr at any given time point r G [0,1]. We consider the 
parameter vector 6 T as a (^-dimensional vector, indexed by distinct pairs 
of nodes, of which an element is non-zero if and only if the corresponding 
edge (u,v) G E T . The problem of recovering the structure of the graph 
G T is now equivalent to estimating the non-zero pattern of the vector T , 
i.e., locations of non-zero elements of 6 T . A stronger notion of structure 
estimation is that of signed edge recovery; for a given graphical model G T 
with parameter r , we define the signed edge vector SE T G as: 



The classical problem of graph recovery can be seen as recovering the vector 
[SE^I of absolute values. The procedure we give next estimates the signed 
edge vector. 

Let r G [0, 1] be any given time point for which we are interested in 
estimating the structure of the graph G T from the sample T> n . Note that in 
order to get insight into the dynamics of the graph changes, one needs to 
estimate the graph G T at multiple values of r. Typically, in a real application 
task, one is interested in estimating G T for all r G T n . The estimation 
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procedure for one value of r is based on the recovery of the neighborhood of 
each node at that time based on all data points in T> n . It is simple to observe 
that recovering the signed edge vector SE T of a graph G T is equivalent to 
recovering, for each vertex u £ V, the set of neighboring edges S T (u) = 
{(u,v) : (u,v) G E T } and the correct signs sign(^„) for all (u,v) G S T (u). 
We define the set of signed neighboring edges as 

S^):={(sign(C),M) = («,«) 6 £»}. 

The set of signed neighboring edges S±(u) can be determined from the signs 
of elements of the (p — l)-dimensional subvector of parameters 

01 ■= K„ ■■ v G V\u} 

associated with vertex u. Under the model (1), the conditional distribution 
of X^ given other variables := {X^ : v G V\u} takes the form 

exp(2a£(0J,x( u )) 

(3) M<|X(„ = xg = eM2x r {e r }X r u)) + V 

where (a, b) = a'b denotes the dot product. For simplicity, we will write 
( x u\X-\ u = x \n) as ^^( X «I X \«)- Observe that the model given in (3) 
can be viewed as expressing X^ as the response variable in the generalized 
varying-coefficient models with playing the role of covariates. 

Under the model given in Eq. (3) the log- likelihood, for one data-point 
t G 7~ n , can be written in the following form: 

7(0„;x i )=logP 0u (x^|x t Vil ) 

(4) , 
= 4(0«> X \J - log (exp((0 u ,x* Xu )) + exp(-(6> u ,x i v 

For an arbitrary point of interest r G [0, 1], we define our estimator 0^ of 
the sign-pattern of the vector Q T U as the solution to the following convex 
program: 

(5) 6 T U = min Al(9 u -V n )+ KWOuWi} 
where 

(6) £(0 u ;T> n ) = - wll(du-y) 

is the weighted logloss, with weights defined as 

K h (t - t) 



Z t , e%l K h (t>-T) 
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Algorithm 1 Graph structure estimation 

Input: Dataset T>„, time point of interest r G [0, 1], penalty parameter A n , bandwidth 
parameter h 

Output: Estimate of the graph structure G T 
1: for all u G V do 

2: Estimate 8 U by solving the convex program (5) 

3: Estimate the set of signed neighboring edges S±(u) using (7) 

4: end for 

5: Combine sets {S±(u)} u ^v to obtain G T . 



and Kh(-) = K(-/h) is a symmetric nonnegative kernel. Note that in the 
objective given in (5) we approximate the function 0* : R i— > W -1 around the 
point r with a constant Q T U E M p_1 . The regularization parameter A n > is 
specified by a user and controls the sparsity of the solution. The program (5) 
is convex, but not differ entiable, because of the i\ norm. The minimum 
over 6 U is always achieved, as the problem can be cast as a constrained 
optimization problem over the ball ||#u||i < C(X n ) and the claim follows 
from the Weierstrass theorem. 

Let 6^ be a minimizer of (5). The convex program (5) does not necessarily 
have a unique optimum, but as we will prove shortly, in the regime of interest 
any two solutions will have non-zero elements in the same positions. Based 
on the vector 0^, we construct the estimate of the signed neighborhood: 

(7) Sl(u) := {(sign(C), (u,v)) : v E V\u, 9 T UV + o} . 

The structure of graph G T is consistently estimated if every signed neigh- 
borhood is recovered, i.e. S±(u) = S±(u) for all u E V. A summary of the 
algorithm is given in Algorithm 1. 

The convex program (5), can be solved using any general optimization 
solver. One particularly fast algorithm, based on the coordinate-wise descent 
method, for this type of a problem is described in [15] and implemented as 
the R package glmnet. 

3. Main theoretical result. In this section, we describe the properties 
of Algorithm 1. We give the conditions under which the estimation procedure 
estimates the unknown graph structure consistently. Using notation from the 
last section, we give conditions under which the estimator SE n satisfies the 
following 

P[SE T n = SE T ] -»• 1, as n -> +oo. 

This property is known as sparsistency. We will mainly be interested in the 
high-dimensional case, where the dimension p = p n is comparable or even 
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larger than the sample size n. It is of great interest to understand the per- 
formance of the estimator under this assumption, since in many real world 
scenarios the dimensionality of data is large. Our analysis is asymptotic and 
we consider the model dimension p = p n to grow at a certain rate as the 
sample size grows. This essentially allows us to consider more "complicated" 
models as we observe more data points. Another quantity that will describe 
the complexity of the model is the maximum node degree s = s n , which is 
also considered as a function of the sample size. The main result describes 
the scaling of the triple (n,p n ,s n ) under which the estimation procedure 
given in the previous section estimates the graph structure consistently. 

Since our main interest is in estimating the structure of a high-dimensional 
graph from a small size sample, we assume that the true structure of the 
graph is sparse, i.e., we assume that each node has a small number of ad- 
jacent edges. The t\ regularization procedures have been proved very suc- 
cessful as model selection techniques in a variety of problems, and, as we 
show here, our method is successful in estimating the time-varying graph 
structure. 

In order to estimate the non-zero pattern of the vector Q T U for each node 
u £ V we need to impose regularity conditions on the covariates in the 
model (3). We express these conditions in terms of the Hessian of the log- 
likelihood function as evaluated at the true model parameter, i.e., the Fisher 
information matrix. The Fisher information matrix QJ^ E k(p~ 1 ) x (p- 1 ) i s a 
matrix defined for each node u G V as: 



is the variance function. We write Q r := and assume that the following 
assumptions hold for each node u £ V. 

Al: Dependency condition There exist constants C m i n , -D m i n , D max > 



(8) 



E[v 2 io g p^[x u |xg] 

E[ ?? (X;^)X Vil X / v J, 



where 




4exp(2z M (fl M ,x\J) 



(exp(2x M (0 u ,x\ u )) + l) 2 



such that 




and 




where £ T = E#t [X r X r ']. Here A m i n (-) and A max (-) denote the mini- 
mum and maximum eigenvalue of a matrix. 
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A2: Incoherence condition There exists an incoherence parameter a G 
(0, 1] such that 

iQWCQ&r'loo^l-a, 
where, for a matrix A G R axb , the matrix norm is defined as 
I Ala, := max ie{lj ... ia} £$=i |dij-|. 

With some abuse of notation, when defining assumptions Al and A2, we use 
the index set 5 := S T (u) to denote nodes adjacent to the node u at time r. 
For example, if s = \S\, then Q55 G R sxs denotes the sub-matrix of Q r 
indexed by S. 

As in the structure estimation of the invariant MRF from an i.i.d. sam- 
ple [21], the Fisher information matrix Q r , associated with the local condi- 
tional probability, plays very important role in determining success of the 
method. It can also be regarded as a counterpart of the covariance matrix 
E[X r X T ] of Gaussian graphical models. Note that the conditions Al and 
A2 are symbolically the same as for the i.i.d. case, when the graph is invari- 
ant over time [21], with the difference that we assume that the conditions 
hold for the time point of interest r at which we want to recover the graph 
structure. Condition Al assures that the relevant features are not too cor- 
related. Condition A2 assures that the irrelevant features do not have to 
strong effect onto the relevant features. Similar type of condition has been 
proposed also in the case of the linear regression (e.g. [19]). 

Next, we assume that the distribution P^t changes smoothly over time, 
which we express in the following form, for every node u G V. 

A3: Smoothness conditions Let S* = [c^J. There exists a constant 
M > such that it upper bounds the following quantities: 

d d 2 
max sup l-p—cr* J < M, max sup |-— -^-cr* I < M 

u,vevxv te[0 ydt wevxvts^dt? 

d d 2 
max sup |— 9 UV \ < M, max sup |_0 | < M. 

u,vevxv te[0jl ] dt «.«evxv te[0)1 ] dt 1 

The condition A3 captures our notion of the distribution that changes 
smoothly over time. If we consider the elements of the covariance matrix and 
the elements of the parameter vector as a function of time, then these func- 
tions have bounded first and second derivatives. From these assumptions, it 
is not too hard to see that elements of the Fisher information matrix are 
also smooth functions of time. 

A4: Kernel The kernel K : R t— > R is a symmetric function, supported in 
[—1, 1], and there exists a constant Mr- > 1 which upper bounds the 
quantities m&x ze -^\K(z)\ and max^gR K(z) 2 . 



12 



KOLAR AND XING. 



This condition, A4, gives some regularity conditions on the kernel used 
to define the weights. For example, the assumption is satisfied by the box 
kernel K(z) = \ \{z E [—1, 1]}. Under the assumption A4, the kernel has 
the following properties: 



2 J zK(z)dz < 2 J K(z)dz = 1 
2 J z 2 K(z)dz < 1. 



With the assumptions made above, we are ready to state the theorem 
that characterizes the consistency of the method given in Section 2 for re- 
covering the unknown time- varying graph structure. An important quantity, 
appearing in the statement, is the minimum value of the parameter vector 
that is different from zero 

#min = riiin \0u V \- 

Intuitively, the success of the recovery should depend on how hard it is to 
distinguish the true non-zero parameters from noise. 

Theorem 1. Assume that the dependency condition Al holds with C m i n , 
D m i n and D max , that for each node u £ V , the Fisher information matrix 
Q T satisfies the incoherence condition A2 with parameter a, the smooth- 
ness assumption A3 holds with parameter M , and that the kernel function 
used in Algorithm 1 satisfies assumption A4 with parameter Mr- Let the 
regularization parameter satisfy 

for a constant C > independent of (n,p, s). Furthermore, assume that the 
following conditions hold: 

1. h = 0(n~\) 

2. s = o(nV3) ; -Lb|£ = (i) 



3> $iriin — ^ 



'mm — iL \ n i/3 J- 

Then for any r £ [0, 1], and in particular for r E T n , the estimated graph 
G T (\ n ) obtained through neighborhood selection satisfies 

(9) P [G T {X n ) ^G T ]=0 (^exp [~C^- + C logp^ - 0, 

for some constants C',C" independent of (n,p,s). 
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This theorem guarantees that the procedure in Algorithm 1 asymptoti- 
cally recovers the sequence of graphs underlying all the nodal-state measure- 
ments in a time series, and the snapshot of the evolving graph at any time 
point during measurement intervals, under appropriate regularization pa- 
rameter A n as long as the ambient dimensionality p and the maximum node 
degree s are not too large, and minimum 6 values do not tend to zero too 
fast. This is a somewhat surprising result because it suggests that structure 
recovery is possible when only one sample or even no sample exactly corre- 
sponding to the structure is available. The key insight behind this possibility 
is the smoothness assumption on graph evolution, which allows data points 
at, in theory, any time point (but in practice nearby time points determined 
by the kernel bandwidth) to contribute to the estimation of a graph at a 
particular time of interest. 

Remarks: 

1. The bandwidth parameter h is chosen so that it balances variance and 
squared bias of estimation of the elements of the Fisher information 
matrix. 

2. Condition 2 requires that the size of the neighborhood of each node 
remains smaller than the size of the samples. However, the model am- 
bient dimension p is allowed to grow exponentially in n. 

3. Condition 3 is crucial to be able to distinguish true elements in the 
neighborhood of a node. We require that the size of the minimum 
element of the parameter vector stays bounded away from zero. 

4. The rate of convergence is dictated by the rate of convergence of the 
sample Fisher information matrix to the true Fisher information ma- 
trix, as shown in Lemma 6. Using a local linear smoother, instead of 
the kernel smoother, to estimate the coefficients in the model (3) one 
could get a faster rate of convergence. 

In the sequel, we set out to prove Theorem 1. The plan is to first show that 
the empirical estimates of the Fisher information matrix and the covariance 
matrix are close elementwise to their population versions. Next, we show that 
the minimizer T U of (5) is unique under the assumptions given in Theorem 1. 
Finally, we show that with high probability the estimator 9^ recovers the 
true neighborhood of a node u. Repeating the procedure for all nodes u £ V 
we obtain the result stated in Theorem 1. 

4. Large deviation inequalities. In this section we characterize the 
deviation of elements of the sample Fisher information matrix Q r := at 
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time point r, defined as 

(10) Q T = x)^( xt ;^) x \« x \«« 

i 

and the sample covariance matrix S T from their population versions Q r and 
S 1 ". As will be seen later, in the proof of the main theorem, consistency result 
crucially depends on the bounds on the difference Q r — Q T and ~S T — ~S T . In 
the following, we use C, C and C" as generic positive constants independent 
of (n,p,s). 

4.1. Sample Fisher information matrix. To bound the deviation between 
elements of Q T = [q£ v i] and Q T = [q^ v i], v, v' G V\u, we will use the following 
decomposition: 

Wlv> - ilv'\ < I wlv(^; 01)44' ~ J2 <^(x*;0*)44'l 
ter n teT n 

(11) + I E <r7(x*;6>i)44, -E[E <^?(x 4 ; 6>* 

ter n ter n 

The following lemma gives us bounds on the terms in Eq. (11). 

Lemma 2. Assume that the smoothness condition A3 is satisfied and 
that the kernel function K{-) satisfies A4- Furthermore, assume 

max \{v G {l,...,p} : ^ ^ 0}| < a, 

te[o,i] 

i.e., i/ie number of non-zero elements of the parameter vector is bounded by 
s. There exist constants C, C, C" > 0, depending on M and Mk only, which 
are the constants quantifying assumption A3 and A4, respectively , such that 
for any r G [0, 1], we have 

(12) max \q T vv , - £ <? ? (x*; 6>* )x*x*,| = Csh 

teT n 

(13) max |E[E<^(x*;0*)4^']-C'l = C ' h - 

v ' v ter n 

Furthermore, 

(14) |^K^;^-EKr,(x ( ;&i)lX])l<f 
wi/i probability at least 1 — 2 exp(— C"nhe 2 ). 
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Using results of Lemma 2 we can obtain the rate at which the element- 
wise distance between the true and sample Fisher information matrix decays 
to zero as a function of the bandwidth parameter h and the size of neigh- 
borhood s. In the proof of the main theorem, the bandwidth parameter will 
be chosen so that the bias and variance terms are balanced. 

4.2. Sample covariance matrix. The deviation of the elements of the 
sample covariance matrix is bounded in a similar way as the deviation of ele- 
ments of the sample Fisher information matrix, given in Lemma 2. Denoting 
the sample covariance matrix at time point r as 

(15) S T = $>t T x t x*', 



and the difference between the elements of S T and 5] r can be bounded as 
Wuv ~ °"m)l I ^ ; w t x u x v ~ ®uv\ 

t&r n 

(i6) ^ I £ < x u x i - E l£ 



ter n ter n 

+ w t xt u x t]-<v\- 
ter n 

The following lemma gives us bounds on the terms in Eq. (16). 

Lemma 3. Assume that the smoothness condition A3 is satisfied and 
that the kernel function K{-) satisfies A4- There are constants C,C > 
depending on M and Mk only such that for any r G [0, 1], we have 

(17) max|E[£ - <J < Ch. 

teT n 

and 

(18) | J2 ™l« " E[£ wJxixlW < e 

teT n ter n 

with probability at least 1 — 2 exp(— C'nhe 2 ). 



A similar result was established in [29] for the case where x is a multi- 
variate Normal distributed random variable. 
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5. Proof of the main result. In this section we give the proof of 
Theorem 1. The proof is given through a sequence of technical lemmas. 
Note that in what follows, we use C, C and C" to denote positive constants 
independent of (n,p, s) and their value my change from line to line. 

The main idea behind the proof is to characterize the minimum obtained 
in Eq. (5) and show that the correct neighborhood of one node at an ar- 
bitrary time point can be recovered with high probability. Next, using the 
union bound over the nodes of a graph, we can conclude that the whole 
graph is estimated sparsistently at the time points of interest. 

We first address the problem of uniqueness of the solution to (5). Note 
that because the objective in Eq. (5) is not strictly convex it is necessary 
to show that the non-zero pattern of the parameter vector is unique, since 
otherwise the problem of sparsistent graph estimation would be meaningless. 
Under the conditions of Theorem 1 we also have that the solution is unique, 
which we prove in two steps. 

Let us denote the set of all solution to (5) as 0(A n ). We define the objec- 
tive function in Eq. (5) by 

(19) F(6 U ) :=- £ <l(0u;^) + K\\0u\\i 

teT n 

and we say that 6 U G R p_1 satisfies the system (S) when 
(20) 

w _ i „ _ -I J ^ter n <(V7(0 U ; x*))„ = A n sign(0 u „) if UV / 
'1 |Et e r n <(V7(0 u ;x*)),|<A„ if0 w = O, 

where 

(21) V 7 (0«;x*) = x[ n {4 + 1 - 2P J4 = l| X y} 

is the score function. Eq. (20) is obtained by taking the sub-gradient of F(6) 
and equating it to zero. From the Karush-Kuhn- Tucker (KKT) conditions 
it follows that U G IR P_1 belongs to @(A n ) if and only if U satisfies the 
system (S). The following Lemma shows that any two solutions have the 
same non-zero pattern. 

Lemma 4. Consider a node u G V . If 6 U £ W" 1 and 6 U G W' 1 both 
belong to 0(A n ) then (x* u ,0 u ) = (x^ u ,0 u ), t G T n . Furthermore, solutions 

6 U and U have non-zero elements in the same positions. 

We now use the result of Lemma 4 to show that with high probability the 
minimizer in (5) is unique. We consider the following event: 

«01 = {Anin - S < y'± T SS y < Anax + 6 : y G R s ,\ |y| | 2 = 1}. 
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Lemma 5. Consider a node u G V . Assume that the conditions of 
Lemma 3 are satisfied. Assume also that the dependency condition Al holds. 
There are constants C, C, C" > depending on M and M% only, such that 

r 

P[fioi] > 1 - 4exp(-CWi(- - C'hf + C"log(s)). 
Moreover, on the event Ooi> the minimizer of (5) is unique. 

We have shown that the estimate 0^ is unique on the event Ooi, which 
under the conditions of Theorem 1 happens with probability converging to 1 
exponentially fast. To finish the proof of Theorem 1 we need to show that the 
estimate 6^ has the same non-zero pattern as the true parameter vector 6^. 
In order to show that we consider a few "good" events, which happen with 
high probability and on which the estimate 0^ has the desired properties. We 
start by characterizing the sample version of the Fisher information matrix, 
defined in Eq. (10). Consider the following events: 

O 02 := {Cmin - S < y'Q5 5 y : y e M s , ||y|| 2 = 1} 

and 

^03 := {|||QW(Qssr% < 1 - f }■ 

Lemma 6. Assume that the conditions of Lemma 3 are satisfied. Assume 
also that the dependency condition Al holds and the incoherence condition 
A 2 holds with the incoherence parameter a. There are constants C, C , C" > 
depending on M , Mk and a only, such that 

P[f\, 2 ] > 1 - 2exp(-C^- + C'log(a)) 

and 

7i h 

P[Q 03 ] > 1 - exp(-C— + C"log(p)). 

s^ 

Lemma 6 guarantees that the sample Fisher information matrix satisfies 
"good" properties with high probability, under the appropriate scaling of 
quantities n,p, s and h. A similar result was obtained for the sample Fisher 
information matrix in [21] for the model that does not change with time. 
Note that the result in Lemma 6 is somewhat harder to obtain since it 
heavily relies on the results of Lemma 2. 

We are now ready to analyze the optimum to the convex program (5). To 
that end we apply the mean-value theorem coordinate-wise to the gradient 
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of the weighted logloss J2teT ^t" V7(# u ; x*) and obtain 
(22) 

£ u£(V 7 (&**) - V 7 (fl£;x*)) = [£ <V 2 7 (^;x*)](^ - flj) + A T , 

ter n ter„ 

where A T € R p_1 is the remainder term of the form 

(23) Al = <(V 2 7 (^;x*) - V 2 7 (^;x'))]U^ - flj) 

ter n 

and is a point on the line between 6^ and and [■](, denoting the u-th 
row of the matrix. Recall that Q r = 2~^teT n w t^ 2 l(@u'i x< )- Using the expan- 
sion (22), we write the KKT conditions given in Eq. (20) in the following 
form, Vf = 1, . . . ,p — 1, 
(24) 

Q;(0 U - 91) + £ teTn w[(V7(ej;x*))« + A; = A n sign(M if 9 UV ± 
\Ql(O u - 01) + £ teTn <(V 7 (0£;x% + AJ| < A n if ra = 0. 

We consider the following events 

= ^oi n ^02 n rio3, 

f) 1 = {Vt;eS : |A n ((QS s )- 1 S ign(05))„-((Q5 s )- 1 W^|<|CI} 

and 
where 

W^= E<V 7 (^;x*) + A^. 

We will work on the event Q,q on which the minimum eigenvalue of Q55 is 
strictly positive and, so, Q T SS 1S regular and H fli and £Iq n are well 
defined. 

Proposition 7. Assume that the conditions of Lemma 6 are satisfied. 
The event 

{\/0 T u £ W- 1 solution of (S), we have sign(0;) = sign(0;)} n n 
contains event S7q H S~2i D ^2- 
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Proof. We consider the following linear functional 
j R s -> R s 

: j e _> e-^ + ^-iWS-Anfe^sign^). 

For any two vectors y = (yi, . . . , y s )' G M s and r = (n, . . . , r s )' G R+, define 
the following set centered at y as 

s 

i=i 

Now, we have 

G (B(0£, \6l\)) = B ((QSs) _1 WS - KifrssT 1 sign(05), l«s 
On the event f^o H f2i, 

G B f (QSs)" 1 ^ - A^fe)" 1 sign(05), |0£|) , 



which implies that there exists a vector 6 T S G B(Og, \0g\) such that G(6g) = 
0._ For it holds that 6 T S = 6 T S + A^Q^)" 1 sign(0£) - (Q5 S ) _1 W5 and 
\0 T S — 6 T S \ < \0g\. Thus, the vector T S satisfies 

sign(0£) = sign(0£) 

and 

(25) Qss(0 T s - e T s ) + W£ = A n sign(0£). 

Next, we consider the vector T = ( ^ ) where Q T S c is the null vector 



6 T SC 



of W 1 s . On event CIq, from Lemma 6 we know that ||Qsc S (Qss) 1 || o < 

1 — §■ Now, on the event f^o n it holds 

(26) 

H-QW(Qs5) -1 WS + W^ + A n Q5c 5 (Q^)- 1 sign(05)|| oo < A„, 

Note that for r , equations (25) and (26) are equivalent to saying that 6 T 
satisfies conditions (24) or (20), i.e., saying that 9 T satisfies the KKT condi- 
tions. Since sign(0£) = sign(#J), we have sign(0 T ) = sign(#J). Furthermore, 
because of the uniqueness of the solution to (5) on the event Qq > we conclude 
that 61 = 6 T . □ 
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Proposition 7 implies Theorem 1 if we manage to show that the event 
0,q n n SI2 occurs with high probability under the assumptions stated in 
Theorem 1. Proposition 8 characterizes the probability of that event, which 
concludes the proof of Theorem 1. 

Proposition 8. Assume that the conditions of Theorem 1 are satisfied. 
Then there are constants C,C > depending on M, Mk, D max , C m i n and 
a only, such that the following holds: 

(27) P[fi nOinfi 2 ]>l-2 exp(-Cnh(X n - sh) 2 + log(p)). 

Proof. We start the proof of the proposition by giving a technical lemma, 
which characterizes the distance between vectors 6^ = 6 T and 0^ under the 
assumptions of Theorem 1, where 6 T is constructed in the proof of Propo- 
sition 7. The following lemma gives a bound on the distance between the 
vectors 6 T S and 6 T S , which we use in the proof of the proposition. The proof 
of the lemma is given in Appendix. 

Lemma 9. Assume that the conditions of Theorem 1 are satisfied. There 
are constants C,C > depending on M, Mr, D maX; Cmin an d a only, such 
that 



(28) ||9S_ e5 || 2 < C vpi 

with probability at least 1 — exp(— C' logp). 

Using Lemma 9 we can prove Proposition 8. We start by studying the 
probability of the event • We have 

n$ c U„ 6S c{W„ + (Q T S o S (Q T ss )- l W T s ) v > |A n }. 
Recall that W T = J2t&T n w l Vt(0£;x*) + A r . Let us define the event 

U 3 = { max K £ <V 7 (^;x t )| < j^-A, 
l<v<p-l ^ 4(2 - a) 

where e v G is a unit vector with one at the position v and zeros 

elsewhere. From the proof of Lemma 9 available in the appendix we have 
that P[^3] > 1 — 2exp(— Clog(p)) and on that event the bound given in 
Eq. (28) holds. 
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On the event Q3, we bound the remainder term A T . Let j : R w 1 
be defined as g(z) = ^^(2%)^ ■ Then VfcOu) = 9(x u (6 u ,x.\ u )). For v G 
{1, . . . ,p — 1}, using the mean value theorem it follows that 

A, = [^<(V 2 7 (^);x*)-V 2 7 (^;x*))];(^-^) 

ter n 
teT n 

= E wl{g\xi{¥:\^ u ))xixl}{[e^ - 0l]'x\j{ u [0l - 61]}, 

= (v ) 

where is another point on the line joining 6^ and 6^. A simple calculation 
shows that \g\x t u (8u\xt))x t u x t v \ < 1, for all t G T n , so we have 



(29) 



A v \ < 


[e { : ] - 




< 


Wa- 


^]'{E< x \« x \J[^-^] 

ter„ 




rn- 




< 


^max 


II /it QT\\2 



Combining the equations (29) and (28), we have that on the event 

max IAJ < CA?,s < 



i<v< P -V u < ~ n 4(2 -a) 

where C is a constant depending on -D max and C m \ n only. 
On the event Oo H O3, we have 

wi + (q;. s( q; s )-w;,„ < ^ + d - < ^ 

and we can conclude that P[^2] > 1 — 2 exp(— C log(p)) for some constant 
C depending on M, Mr, C m i n , -D max and a only. 

Next, we study the probability of the event VL\. We have 

(30) n? C U,, eS {A n ((Q5 s )- 1 S ign(0£))„ + ((QhT'W^v > CI- 
Again, we will consider the event O3. On the event Oo H O3 we have that 

(31) A n ((Q5 s )- 1 sign(^)), ; + ((Q5 s )- 1 W5), < ^ + < C\ n ^s, 
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for some constant C. When m [ n > C\ n yfs, we have that P[f2i] > 1 — 
2exp(— Clog(p)) for some constant C that depends on J\d, Cmin? ^max 
and a only. □ 

In summary, under the assumptions of Theorem 1, the probability of event 
QoD^iD^ converges to one exponentially fast. On this event, we have shown 
that the estimator Q T a is the unique minimizer of (5) and that it consistently 
estimates the signed non-zero pattern of the true parameter vector 9^, i.e., 
it consistently estimates the neighborhood of a node u. Applying the union 
bound over all nodes u £ V, we can conclude that our estimation procedure 
explained in Section 2 consistently estimates the graph structure at a time 
point r. 

6. Conclusion. In the paper we focus on sparsistent estimation of the 
time-varying high-dimensional graph structure in Markov Random Fields 
from a small size sample. This work builds on the previous work of Raviku- 
mar et al. [21] which analyzed the invariant graph structure estimation from 
the i.i.d. sample. Our main contribution is the new nonparametric proce- 
dure applicable to much more realistic scenarios in which the data generating 
process changes over time and the sufficient conditions under which our pro- 
cedure succeeds in recovering the unknown time-evolving graph structure. 

An interesting open direction is estimation of the graph structure from a 
general time-series, where observations are dependant. In our opinion, the 
graph structure that changes with time creates the biggest technical diffi- 
culties. Incorporating dependant observations would be an easier problem 
to address, however, the one of great practical importance, since samples in 
the real data sets are likely to be dependant. Another open direction is to es- 
tablish necessary conditions, to complement sufficient conditions established 
here, under which it is possible to estimate a time-varying graph structure. 
In this work we have assumed that the distribution changes smoothly, an 
assumption more realistic than the assumption that the distribution is time- 
invariant, however, it may still be invalid for real datasets. Under the as- 
sumption that the distribution generating observations undergoes a number 
of shifts, it would be interesting to modify the current approach to allow for 
use of a change-point detection framework to estimate a sequence of graph 
structures. 

7. Appendix. Note that in what follows, we use C, C and C" to denote 
positive constants and their value may change from line to line. 
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7. 1 . Proof of Lemma 2. We start the proof by bounding the difference 
|?7(x; 0„ +<s ) — t?(x; which will be useful later on. By applying the mean 
value theorem to ?](x; •) and the Taylor expansion on l u we obtain: 

|r?(x; O - r?(x; fl* )| = | £(0^ " 

v=l 

p-1 

— / y I "mi uv 
v=\ 

P-1 



0^ is a point on the line 
between and 



(|r/(x;.)|<l) 







<5 2 9 2 



l^g^W + 9 ^2 



1>=1 



2 9t 2 



/3„ is a point on the line 
between t and t + 5 



Without loss of generality, let r = 1. Using the above equation, and the 
Riemann integral to approximate the sum, we have 

|K(^I)[ 7? (x-;^)-r ? (x^;^)]^^l 



/? 



ft, 

,0 P- 1 f) 

<2 K(z')[J2\z'hU 

J - 1 v=l UL 



(z'h) 2 d 2 



< Csh, 



for some constant C > depending on M from A3 which bounds the deriva- 
tives in the equation above, and Mk from A4 which bounds the kernel. The 
last inequality follows from the assumption that the number of non-zero 
components of the vector l u is bounded by s. 

Next, we prove equation (13). Using the Taylor expansion, for any fixed 
1 < v, v' < p — 1 we have 

mJ2 wlvi^^DxWv'] ~ llv>\ 



teT n 



vv' 



£ <((*- r )^w 

teT n 



+ 



(t - r) 2 d 2 , 



2 dt 2 

where £ G [i, r]. Since w[ = for |i — t| > /i, we have 



max|E[£ wl^-OlKA'] ~ lU < C'h 

v,v' ~~ 

teT n 
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for some constant C > depending on M and M% only. 

Finally, we prove equation (14). Observe that wlrj(x ; 6 U )x v x v , are in- 
dependent and bounded random variables [—wJjWj]. The equation simply 
follows from the Hoeffding's inequality. □ 



7.2. Proof of Lemma 3. To obtain the Lemma, we follow the same proof 
strategy as in the proof of Lemma 2. In particular, Eq. (17) is proved in the 
same way as Eq. (13) and Eq. (18) in the same way as Eq. (14). The details 
of this derivation are omitted. □ 

7.3. Proof of Lemma 4- The set of minima 0(A n ) of a convex function 
is convex. So, for two distinct points of minima, 9 U and G u , every point on 
the line connecting two points also belongs to minima, i.e. £0 U + (1 — £)0 U € 
0(A n ), for any £ G (0, 1). Let rj = 9 U — 9 U and now any point on the line can 
be written as 9 U + £77. The value of the objective at any point of minima is 
constant and we have 

F(9 u + Cv) = c, £€(0,1), 

where c is some constant. By taking the derivative with respect to £ of 

F(9 U + Cv) we obtain 

(32) 



ter n 



exp((0 u + £T7,xp) - exp(-(0 u + £t7,x^)) 



V 



exp((0 u + £r?, xM) + exp(-(0 u + £77, xM) 



P-l 



+ A n Vv sign(0 u „ + £??,,) = 0. 



On a small neighborhood of £ the sign of 9 U + £?7 is constant, for each 
component v , since the function 9 U + £ry is continuous in £. By taking the 
derivative with respect to £ of Eq. (32) and noting that the last term is 
constant on a small neighborhood of £ we have 



4 w t(V,X\u) 



t£T n 



exp(-2(0 M + £T7,x* M )) 
l + exp(-2(0 u + £r 7 ,x* )) 



0. 



This implies that (77, x* u ) = for every t € T n , which implies that (x' u , 9 U 



V 



(x* u , G u ), t G T n , for any two solutions G u and 9 U . Since 9 U and U were two 
arbitrary elements of 0(A n ) we can conclude that (x* u ,0 n ), t £ T n is con- 
stant for all elements 9 U G 0(A n ). 
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Next, we need to show that the conclusion from above implies that any 
two solutions have non-zero elements in the same position. Prom equation 
(20), it follows that the set of non-zero components of the solution is given 
by 



S= jl < v<p-l : 
Using equation (21) we have that 

$>t(V7(<£; **))„ = 



£<(v 7 (0u;x*))„ 

teT n 



A 



teT n 



exp(2x^(^,x^)) 
exp(24<0£,xp) + l 



which is constant across different elements 6 U G 0(A n ), since (x* u ,0 u ), i G 
T n is constant for all 6 U G 0(A n ). This implies that the set of non-zero 
components is the same for all solutions. □ 

7.4. Proof of Lemma 5. Under the assumptions given in the Lemma, 
we can apply the result of Lemma 3. Let y £ K s be a unit norm minimal 
eigenvector of We have 

Amines) = min x'Sj s x 

IM|2=1 

= min {x'ES s x + x'(ES s -i£ s )x} 

||x|| 2 =l 

which implies 

A m in(£s\s) ^ ^min ~ K^SS ~ ^Ss)h- 

Let 5] T = [cr^J and S r = [crjj. We have the following bound on the spectral 
norm 

/ s s \ V2 

||s5s - iiSsh < E XK« - <vf < s, 

\ u =l„=l / 

with the probability at least 1 — 2 exp (-Cnh{^-C'h) 2 + C"\og{s)), for some 
fixed constants C, C , C" > depending on M and Mk only. 
Similarly, we have that 

Amax^gg) < -Dmax + 
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with probability at least 1 — 2exp(— Cnh(^ — C'h) 2 + C"log(s)), for some 
fixed constants C, C", C" > depending on M and Mk only. 

From Lemma 4, we know that any two solutions 9 U ,9 U G 0(A n ) of the 
optimization problem (5) have non-zero elements in the same position. So, 
for any two solutions 6 U ,6 U G 6(A n ), it holds 

X\ u (0« — U ) = X\ U>S (0 U — 6 u )s + X.\ UtS c(6 u — O u )sc = ~K\ UtS (0 u - 6 u )s- 

Furthermore, from Lemma 4 we know that the two solutions are in the 
kernel of X\ Uj s. On the event Ooij kernel of Xwg is {0}- Thus, the solution 
is unique on f2oi- D 

7.5. Proof of Lemma 6. We first analyze the probability of the event 
f2o2- Using the same argument to those in the proof of Lemma 5, we obtain 

-A-min(QsS') — Cmin — IIQs5 ~~ Qssllb- 

Next, using results of Lemma 2, we have the following bound 

1 /2 

(33) |||QS 5 - dl s h < (j2 £(&, " ID 2 ) < 5, 

\«=i«=i / 

with probability at least 1 — 2exp(— — h 21og(s)), for some fixed con- 
stants C, C > depending on M and Mk only. 

Next, we deal with the event Slo3- We are going to use the following 
decomposition 

Qs c s(Qss) 1 = Qs ,c s'[(Qs'5) 1 ~~ (Q55) ] 

+ fe 5 - QWlKQSs)" 1 - (Q55)" 1 ] 
+ Qs c s(Qss) 1 

= T 1 +T 2 +T 3 + T±. 

Under the assumption A2, we have that iT^loo < 1 — a. The lemma 
follows if we prove that for all the other terms we have ||| - loo < ^. Using 
the submultiplicative property of the norm, we have for the first term: 

11^1 III 00 < IQ^S (Qss) 1 llloo|||Qs'5 — QssJoollKQss) 1 llloo 

< (I-^IIIQ^-Q^IIIoov^IIKQSs)" 1 !^- 

Using Eq. (33), we can bound the term ||| ^QJsj I2 < C" , for some con- 
stant depending on C m i n only, with probability at least 1 — 2exp(-C^ + 
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21og(s)), for some fixed constant C > 0. The bound on the term IQ55 — 
Qssll°° f 01 l ows horn application of Lemma 2. Observe that 

niQss - Q55IIU >S\ = P[max{ £ \qr vv , _ ^,| } > § ] 

(35) l " eS 

< 2eM-Cnh(- -C'sh) 2 + 2\og(s)), 
s 

for some fixed constants C, C > 0. Combining all the elements, we ob- 
tain the bound on the first term |||Ti ||| oo < ^, with probability at least 
1 - Cexp(C"f£ + C"log(s)), for some constants C, C, C" > 0. 
Next, we analyze the second term. We have that 

1^2 II 00 < IQ^S ~~ Qs c slll°o\/s||| (Qss) 1 III 2 

( 36 ) . j/i 



^ T< I Qs c s ~ Qs c s III 00 • 



The bound on the term IQ55 — Q55II00 follows in the same way as the bound 
in Eq. (35) and we can conclude that l^loo < ^ with probability at least 
1 - Cexp(C"f£ + C"log(p)), for some constants C, C , C" > 0. 

Finally, we bound the third term T3. We have the following decomposition 

lllfcs - Q^s][(Qs5) _1 - (Q5s) _1 ]lloo 

^ IIIQs c S ~ Qs c slll°°V^|||(Qss) _1 [QsS _ Qssl (Qss)" 1 III 2 

- III Qs c S ~ Qs c S III 00 III Q>SS ~~ QsS III 2 III (Q,SS ) 1 III 2 • 

Bounding the remaining terms as in equations (36), (35) and (34), we obtain 
that I T 3 loo < § with probability at least 1 - Cexp(C"f£ + C"log(p)). 

Bound on the probability of event follows from combining the bounds 
on all terms. □ 

7.6. Proof of Lemma 9. To prove this Lemma, we use a technique of 
Rothman et al. [23] applied to the problem of consistency of the penalized 
covariance matrix estimator. Let us define the following function 



H : 



d ^ F(ei + B)-F(ei), 



where the function F(-) is defined in equation (19). The function H(-) takes 
the following form 

H(B) = £ <( 7 (^;x*)- 7 (^ + D;x*)) 

t£T n 
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Recall the minimizer of (5) constructed in the proof of Proposition 7, 
Q T U = (0' s , 0' S c)'. The minimizer of the function H(-) is D = 6^ — 9^. Function 
H(-) is convex and H(0) = by construction. Therefor H(D) < 0. If we 
show that for some radius B > 0, and Del p with | |D| \ 2 = B and D^c = 0, 
we have H(D) > 0, then we claim that ||D||2 < B. This follows from the 
convexity of H(-). 

We proceed to show strict positivity of H(-) on the boundary of the ball 
with radius B = KX n ^/s, where K > is a parameter to be chosen wisely 
later. Let D G W be an arbitrary vector with ||D||2 = B and D^c = 0, then 
by the Taylor expansion of 7 (-;x*) we have 

H(D) = -(X>rv 7 (0£;x*))'D 

ter n 

(37) " D'[ £ wlvtf ; 01 + «D)x^x(jD 

+ A n (||^ + D|| 1 -||^|| 1 ) 
= (/) + (//) + (///), 

for some a G [0, 1]. 

We start from the term (/). Let e„ G W be a unit vector with one at the 
position v and zeros elsewhere. Then random variables —e' v J2t&T n w t^l{0u'-> x *) 
are bounded [— Q\ for all 1 < f < p— 1, with constant C > depending 
on M# only. Using the Hoeffding inequality and the union bound, we have 

max K(X <V 7 (0£;x*) -E[X <V 7 (^;x t )])| < S, 

Kv<p— 1 ~~ ~~ 

with probability at least 1 — 2exp(— CnM 2 + log(p)), where C > is a 
constant depending on Mk only. Moreover, denoting 

p(0*)=p t[4 = i| x y 

to simplify the notation, we have for all 1 < v < p — 1, 

iek E< v ^ x *) I -t x \«WJl 

= |E[]T <4[4 + 1 " MOD] I {x^WJI 

ter n 
r0 



(38) 



< ±[K{z)\p{ei+ zh )-p(Ol)\dz. 
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Next, we apply the mean value theorem on p(-) and the Taylor's theorem 

on 0* . Under the assumption A3, we have 

(39) 

Wl +Zh ) - p{0l)\ 

< E \ ut zh - ci dp'(-)i<i) 

v=l 

= E \* m <L ^ + K -i~ W <L t =J ( «■ 6 [r + zh,r] ) 

v=l 

< Cs\zh + ^-\, 

for some C > depending only on M. Combining (39) and (38) we have 
that |E[e^ tgTn u> t T V7(^;x i )| < Csh for all 1 < v < p - 1. Thus, with 
probability greater than 1— 2exp(— Cnh(X n — s/i) 2 +log(p)) for some constant 
C > depending only on Mk,M and a, which under the conditions of 
Theorem 1 goes to 1 exponentially fast, we have 



max K £<V 7 K;x*)|<-- 
i<«<p-i rrp 4(2 — a) 4 

On that event, using Holder's inequality, we have 

|(E <V 7 (flJ;x*))'D| < IIDHi max |< £ <V 7 (0£;x*)| 
ter n — ^ ter n 

<^l|D|| 2 <(A„v^) 2 j. 

The triangle inequality applied to the term (///) of equation (37) yields: 

AnGK + DHi-ll^lli) > -A„||D s ||i 

> -A n v^||Ds||2 > -K(X n ^) 2 . 

Finally, we bound the term {II) of equation (37). Observe that since Ds c '■ 
0, we have 

D'[ E <^(x*; OS + aD)x( u x(jD 

= D' s [^u;[7 ? (x t ;^ + aD)x t s x|]D s 

*GT n 

> i*T 2 A min ( E <v(A ^ + «D)x* s x^) 

teT n 



a\ n , A r 



< — . 
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Let 5 : 1h M. be defined as g(z) = jp^^^i- Now, 77 (x; U ) = g(x u {6 u , x\ u )) 
and we have 



Amin( E w Mx f ; oi + «d)44) 



> min A min ( V ^(x* ; T U + aD)x^x|) 

ae[0 ' 1] ter n 



> 



A mi n(E<^ xi ;^)44; 



m r ^, III E ^V(4(^ + aD ) x* s ))(x^D / 5 x i 5 )x* s x t ;| 



• C min - max If ^ <g / (x^(^ + aD,x^))(4D^x t 5 )x* s x^| 2 



To bound the spectral norm, we observe that for any fixed a € [0, 1] and 
y £ M s ,||y||2 = 1 we have: 

y'{ E wlg'(xi(0l + aD,4))(^D' s x^x£}y 

t&T n 

= E < 5 '^ l (^ + aD,x* 5 ))(4D' s x* s )(x^y) 2 

< £ <| 5 '^(^ + aD,x i 5 ))(x t lt D' s x t s )|(x^y) 2 

ter n 

<^I|D|| 2 |E^4| 2 ( < 1 ) 



S~1 

The last inequality follows as long as X n s < 2D min K ■ We have shown that 

A min ( E <^( x< ; ° T u + > 

i€T„ 

with high probability. 

Putting the bounds on the three terms together, we have 

H(B) > (A n ^) 2 {"^ + %^ 2 -^}, 
which is strictly positive for K = r , 5 . For this choice of K, we have that 

^ mi n 

c 2 . 

X-nS < in n in , which holds under the conditions of Theorem 1 for n large 
enough. □ 
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